Multiple correspondence analysis as a tool for examining Nobel Prize data from 1901 to 2018

The main goal of this paper is to examine Nobel Prize data by studying the association among the laureate’s country of birth or residence, discipline, time period in which the Nobel Prize was awarded, and gender of the recipient. Multiple correspondence analysis is used as a tool to examine the association between these four categorical variables by cross classifying them in the form of a four-way contingency table. The data that we examine comprise Nobel Prize recipients from 1901 to 2018 (inclusive) from eight-developed countries, with a total sample of 785 Nobel Prize recipients. The countries include Canada, France, Germany, Italy, Japan, Russia, the British Isles, and the USA and the disciplines in which the individuals were awarded the prizes include chemistry, physics, physiology or medicine, literature, economics, and peace.


Introduction
The Nobel Prize is a prestigious international award that was created in 1901 and awarded to individuals who make significant contributions to the fundamental understanding in the disciplines of physics, chemistry, physiology or medicine, literature, economics, and peace. The Royal Swedish Academy of Sciences awards the Nobel Prize in the disciplines of physics and chemistry, while the Nobel Prizes in physiology and medicine are awarded by the Nobel Assembly at the Karolinska Institute in Sweden, and the literature Prize is awarded by The Swedish Academy [1]. These three institutions also have special Nobel committees that grant a Nobel Prize in the discipline of economics, a prize that was established in 1968 through a grant from Sweden's central bank, Sveriges Riksbank, to the Nobel Foundation to commemorate the bank's 300th anniversary [2]. The sixth prize, the Nobel Peace Prize, is awarded by the Norwegian Nobel Committee and selected by a committee organized by the Norwegian Parliament [1].
The history of the Nobel Prize dates back to the Swedish businessman and engineer, Alfred Nobel, who invented dynamite in 1867. This invention was the main reason behind Nobel's fortune, which was estimated at 31.5 million Swedish Krone [3]. In 1895, Alfred Nobel recommended directing his fortune to the creation of an institute to award prizes to those who contributed significant works that benefited chemistry, physics, physiology or medicine, literature, economics, and/or peace, with each prize named in his honour [4]. Alfred Nobel died in 1896, leaving behind substantial achievements, which have contributed positively to the progress of many areas of science and culture and, through his will, directed most of his fortune to the creation of these prizes. The first prize was awarded in 1901, which coincided with the fifth anniversary of Alfred Noble's death, and from then on, the Nobel Prize has been awarded annually to any individual or team in the specified disciplines. World War II (WWII) affected the distribution of Nobel Prizes, which caused the prize to be suspended from 1940 to 1942, and in 1943 the prize was granted only for achievements in medicine, chemistry, and physics [5]. Between 1901 and 1940, Nobel Prize recipients were predominantly from Germany and France. This changed significantly during WWII, largely due to a mass migration that occurred at the time, resulting in a complete collapse of science in Germany and Eastern Europe after World War II. Despite this, and due to the effects of fascist regimes, the impact of migration on science led to an increase in United States recipients receiving a Nobel Prize [6].
Nussbaum [7] indicated the limited number of women pursuing higher education in research as a potential factor contributing to fewer women Nobel laureates. However, Nussbaum also noted that the proportion of women involved in research was much higher in developed countries compared to those in developing countries [7]. Since research is timeconsuming, research was considerably more challenging for women, compared to men, due to difficulties in maintaining family-work balances [8]. In the last 30 years, however, this pattern has changed, resulting in more women taking up research as a profession; modern women have become more career-oriented, with increased involvement of women in science over the last few decades [9].
The purpose of this paper is to analyze the awarding of a Nobel Prize while considering the nature of the association between each recipient's Country of nationality, the Discipline in which the award was received, the recipient's Gender, and the Time period in which the recipient was awarded the prize. This paper extends the study of Alhuzali, Stojanovski, and Beh [10], where the attention was confined to the association between only two of the categorical variables, Country and Discipline, by using simple correspondence analysis (SCA). This paper will extend their study by illustrating the applicability of multiple correspondence analysis (MCA) to represent the underlying structures in the data by simultaneously examining the association between Country, Discipline, Gender, and Time. MCA provides a powerful tool to explore complex Nobel prize data, covering multiple variables including Country, Discipline, Gender, and Time Period to demonstrate deeper insights into how the underlying variables are related, with the main benefit being the ability to represent multiple categorical variables visually, which standard statistical tests do not allow. This paper presents the MCA of the association between the four variables. In Section 2, a brief description of the data is provided (Section 2.1) as well as the specifics of the MCA technique (Section 2.2). Section 3 provides a comprehensive description of the results from the MCA of the data where Section 3.1 gives a general overview of the Nobel Prize data based on the gender of the recipient during three different time periods (1901-1940, 1941-1980, and 1981-2018). An overview of the Nobel Prize data based on the gender of the recipient is provided in Section 3.2, while Section 3.3 discusses the distribution of Nobel Prizes during these three time periods. Section 3.4 provides a detailed assessment of the association between the four variables, Country, Discipline, Gender, and Time Period, with concluding statements made in Section 4.

Data
This study examines the awarding of Nobel Prizes from 1901 to 2018 across eight-developed countries. The data examined was obtained from two official websites: the Nobel Foundation [2] and the Encyclopedia Britannica [11]. The study consists of four categorical variables: Country

Statistical analysis
MCA is an extension of SCA, where the nature of the association between more than two categorical variables can be visually studied. The key benefit of MCA is that it is a powerful multivariate statistical technique used in large, complex datasets to analyze and graphically represent multivariate categorical data; see, for example, Greenacre [12], Greenacre and Blasius [13], and Beh and Lombardo [14] for a discussion of many issues concerning MCA. While these contributions discuss in great detail various aspects of the classical approach to MCA there is also a wealth of application that demonstrate the practical benefits of the technique. One may consider, for example, Greenacre [15,16], Greenacre and Pardo [17], Barth [18] and Dungey, Tchatoka and Yanotti [19] which cover the disciplines of medicine, health, gender attitudes and finance. To describe this approach, consider a multiway contingency table, N, formed from the cross-classification of Q categorical variables, X 1 , X 2 ,. . ..,X Q that consist of J 1 , J 2 ,. . ..,J Q categories respectively. For our multiway contingency table, N, we define the total number of categories as J ¼ P Q q¼1 J q . The first step to applying MCA to a Q-way contingency table is to transform it into the form of a two-way matrix. Here we will confine our attention to using crisp coding; one of the most commonly implemented approaches of coding used for MCA. Crisp coding is performed so that either a 0 or 1 is allocated to each individual based on whether a characteristic of a variable is observed (1) or not (0); see, for example, Beh and Lombardo [14,15]. For our Q-way contingency table, the n×J q matrix Z q is the indicator matrix for qth variable. This is a matrix of 0's and 1's; for each row there are J q −1 zeros for categories where the observation was not made, and a single 1 for the observed category. Thus, a super-indicator matrix Z of size n×J is formed by concatenating the Q indicator matrices such that: Therefore, the row marginal frequencies of Z are all equal to Q, and the column frequencies of Z are equal to the marginal frequencies of the Q-way contingency table. The total of Z is therefore nQ. The matrix Z will often consist of a large number of rows when the sample is very large. To avoid this problem, an alternative strategy can be used by defining and analyzing the Burt matrix C [20]. The Burt matrix is of size J×J and is obtained directly from the indicator matrix Z such that: . . .
where, P 12 , for example, is the I 1 ×I 2 two-way matrix of proportions for the first variable and the second variable, and D 1 denotes the diagonal matrix of marginal proportions for the first variable. Given the symmetric nature of C, an eigen decomposition can be performed so that: The Pearson chi-squared statistic of the Burt matrix, X 2 C , consists of the sum of the chisquared statistics for each two-way appearing in the off-diagonal sub-matrices of C and the diagonal sub-matrices holding its marginal proportions, namely: where X 2 qq0 is the chi-squared statistic for the (q, q 0 ) off-diagonal sub-matrix of C [21]. By adopting the terminology commonly used in correspondence analysis, the quantity X 2 /n is proportional to Pearson's chi-squared statistic and is referred to as the 'total inertia' of the contingency table; its components are referred to as 'principal inertias' that are often expressed as a percentage of the total inertia [22]. Therefore, the total inertia of the Burt matrix is: The expression after the plus sign (+) represents the total inertia of the sub-matrices that lie along the diagonal of C. Thus, the total inertia of the Burt matrix is artificially inflated by a factor of (J−Q)/nQ 2 ); this inflation can be problematic, particularly if the number of categories is much larger than the number of variables. Since the existence of the off-diagonal sub-matrices of the Burt matrix may affect the magnitude of the total inertia, Greenacre [22] proposed the method of Joint Correspondence Analysis (JCA) by developing an algorithm to solve this issue. His algorithm starts by minimizing whereĈ is a low-rank approximation of the Burt matrix C, and where r Q is the vector of column proportions, (U C ) T D(U C ) = QI, and E is a super-diagonal matrix of "error" terms that are iteratively estimated. Thus, SCA is performed on the modified Burt matrix in order to obtain a new modified Burt matrix by replacing the diagonal blocks with estimates from the new solution. This algorithm is repeated several times until convergence is achieved, and the fit of the off-diagonal blocks is improved with each iteration [23]. Therefore, the total inertia can be expressed in terms of the elements of D λ J as: where l J m is the mth principal inertia (eigen value) where M is at most (min(n, J)−1) and is the maximum number of dimensions needed to visualize the association. Since, for many applications of MCA and JCA, J�n, M will be at most J−1, although Greenacre [22] showed that the optimal number of dimensions needed when analyzing the Burt matrix or performing JCA is J−Q. Table 1 is the four-way contingency table of the categorical variables Country and Discipline, based on Gender and Time. By performing a chi-squared test of independence on this table, we find that there is a statistically significant association between the four variables (X 2 = 107.016, df = 35, p<0.001). Aggregating by Gender, there exists a statistically significant association between Country, Discipline, and Time (p<0.001). There is also a statistically significant association between Country, Discipline, and Gender (p<0.001). From Table 1, it is noted that the majority of the Nobel Prize recipients are male and account for approximately 742 of the 785 prizes (or 94.5%) awarded between 1901 and 2018, with only 43 prizes (or 5.5%) awarded to female recipients. In the following two sections, an exploration of the awarding of the Nobel Prize data will be provided, with a particular focus on the Gender of the recipient and the Time period during which the award was received. Table 1 shows that, when compared with females in other countries, more females in the USA and France received Nobel Prizes, with notably more female recipients from the USA in the more recent time period (1981-2018) while there were more females from France in the earlier time period . Of all female recipients, 37% were from the USA, with the most prominent disciplines among female recipients in the USA in medicine (50%) and peace (19%). Of all female laureates, 35% were from France, with the highest proportion of French female recipients in the peace discipline (73%). While prizes in Peace followed by Medicine were the most prominent among female recipients (42% and 23% respectively), the economics prize was only awarded once to a female (2%), while the remaining prizes among females were distributed evenly between the disciplines of chemistry and physics (9% each), with 14% of female prizes in literature.

Overview of Nobel Prize data based on gender of the recipient
While more males than females are awarded Nobel Prizes, female recipients in France received more of the overall peace prizes (18%) than their male counterparts (2%). By comparing the male recipients in the eight countries, there were more Nobel Prize recipients in the  USA than in the other seven countries; the award disciplines including chemistry, economics, medicine, peace, and physics, with US recipients receiving 46%, 73%, 51%, 42%, and 48% of these prizes, respectively. However, of the male recipients, those in France dominated the literature Nobel Prize, receiving approximately 27% of them.

Overview of Nobel Prize data based on time periods
From Table 1, it is observed that recipients from Germany received more Nobel Prizes between 1901 and 1940 than any other country, receiving nearly one third (31%) of the prizes awarded during this period. Most of the prizes among German recipients in this period were in chemistry (36%), physics (25%), and medicine (20%). Over this same time period, France received approximately 22% of the Nobel Prizes awarded. Of the prizes awarded to France over this time period, 23% were in each of chemistry, literature, and peace; while 19%in physics and 13% in medicine. The Nobel Prize was not awarded in economics during this time period. Of the BI prize recipients during this time period, 32% were in physics, while 23% and 19% were awarded Nobel Prizes in medicine and chemistry, respectively. The BI received 19% of Nobel Prizes between 1941 and 1980; the largest share of the Nobel Prizes after the USA, which received 50% of the awards over this time period. Most of the prizes to BI and American recipients combined were in chemistry (20%), medicine (33%), and physics (27%). The USA continued with its high level of achievement between 1981 and 2018, receiving 58% of Nobel Prizes awarded, with about 15% of prizes awarded to each of chemistry, physics, and medicine disciplines.
Between 1941 and 1980, the Nobel Prize in economics was awarded for the first time, with the USA receiving 64% of the total economic prizes awarded during this period, the highest percentage compared to other countries during this time period. The USA dominated Nobel Prizes during this period, receiving nearly half of the total awards presented; with recipients from the USA being awarded 18% of prizes in medicine, while those in physics and chemistry received 15% and 8%, respectively, of the Nobel Prizes awarded in those disciplines during this period. From 1980 onward, Canada and the USA dominated the economics prize, potentially reflecting their dominance in neoliberal policies that have dominated the world economic system [24]. Economics is very much considered a USA based intellectual discipline, with the international graduate study of economics increasingly following and reflecting the model in the USA [24].
Migration is an important contributing factor to the success of America in the Nobel Prize, particularly in the field of economics from 1980 onwards [25], with immigrants to the United States who won a Nobel Prize ranking second only to United States laureates [25], with their number exceeding the number of laureates born in any country alone. Najam indicated that there are several reasons that America excels in economics, including high immigration rates and world class education along with the openness of American academics to scholars internationally [25].
There appears to be a Russian and Italian dominance of the Nobel Prize for Literature in the first half of the twentieth century. This dominance appears to later diminish due to the then questioning of the value of the Nobel Prize for Literature in Italy, with accusations driven by partisan politics that attracted attention from the award committee, which was speculated as withholding the award from loyalists to dictatorial and fascist regimes, hence potentially detracting from the literary content of nominated works [26]. Furthermore, it was noted in Engdahl's book, 'Nobel Sensibility', that due to the growing interest in the Literature Nobel Prize, and due to its cultural and literary status, awardees have become modern literary legitimacy, triggering a trend that has caused the removal of many great writers of the twentieth century from the list [27].

Multiple correspondence analysis results
While Section 3.1 provides a numerical summary (through simple percentages) of the association between the variables Country, Discipline, Gender, and Time, the focus will now shift to the visualization of this association. The data in Table 1 is consequently analyzed using MCA, where the calculations and visual summary are produced using the "ca" package of Nenadić and Greenacre [28]. MCA is performed by first transforming Table 1 into its corresponding Burt matrix, and then performing a simple correspondence analysis (SCA) on this Burt matrix. Fig 1 shows the two-dimensional correspondence plot resulting from the MCA. Since the first two dimensions account for 11.6% and 8.8%, respectively, of the total inertia, Fig 1 visually describes only approximately 20% of the association between the four categorical variables.
With such a small percentage, Fig 1 is deemed to be a poor visual summary of the association. To subsequently improve the quality of the two-dimensional display, JCA was performed using the "ca" package. The correspondence plot generated from this analysis is provided in Fig 2 and now visually accounts for 81% of the association between the four categorical variables. The two-dimensional correspondence plot generated from the JCA of the data in Table 1    To provide a more detailed interpretation of the association, consider the following points that are made based on the results summarized in Table 2. The first three columns of Table 2 summarize the percentage of the total inertia contributed by each category and the principal coordinates for the two dimensions, respectively. The correlation (cor) and the contribution (ctr) of each category are displayed in the fourth and fifth columns. From Table 2, the quality for the categories USA, Germany, and France of the Country variable are 0.934, 0.898, and 0.863, respectively. Therefore, Fig 2 depicts 93%, 90%, and 86% of the contribution made by the USA, Germany, and France, respectively, that visually depicts the association between Country, Discipline, Gender, and Time. While these are the best-represented categories in the correspondence plot, the country with the poorest quality of representation is Russia, in which only 10% of its contributions to the association are depicted in Fig 2; the remaining 90% can be accounted for by the third and higher dimensions.
From Fig 2, it is clear that female recipients of the Nobel Prizes from France are strongly associated with the Peace prize; a point that was noted in Section 3.2 when summarizing the association using percentages. Nobel laureates from Italy dominated the Nobel Prizes in literature from 1901 to 1940. There is a strong association between recipients from Canada and the USA and the awarding of the Nobel Prize in Economics, particularly during the 1981-2018 time period. It is clear also that there is a strong association from 1941 to 1980, the categories M, BI, Me, Ch, and Ph, since they are tightly clustered together. This implies that men in the BI are strongly associated with being awarded Nobel Prizes in the disciplines of medicine, chemistry, and physics during the period 1941-1980. However, since the proximity of their points lies close to the origin, their contribution to the association is not as dominant as other category combinations. Moreover, the categories RU and JP are associated with the Ph category, indicating that laureates from Russia and Japan are strongly associated with the discipline of physics. As mentioned in Section 3.3, the recipients of Nobel Prizes from the US were

Discussion
In this paper, MCA was used to graphically represent and interpret the associations among the Country of the nominated individual (or of the nominated team), Discipline in which the Nobel Prize was awarded, Time period that the recipient received the award, and Gender of the recipient. Furthermore, JCA was used to improve the quality of the two-dimensional display using the "ca" package. This technique provides a comprehensive description of the association between these variables. This paper also demonstrated the value of MCA for the assessment of temporal data. In this instance, the duration over which Nobel prizes were awarded was investigated. Three time periods were investigated for the present study : 1901-1940, 1941-1980 and 1981-2018, and the relationship of time period with country and discipline was assessed. The MCA enabled the complex multivariate association between these three variables to be assessed simultaneously, demonstrating how MCA can represent, describe and analyse temporal data effectively. The results from the MCA show not only a statistically significant association between the variables Country, Discipline, Gender, and Time, but also describe how specific categories are related to each other. However, by the nature of MCA, the variables are assumed to be symmetrically associated; where they are all treated as predictor variables. That stated, one may wonder if, for example, a Nobel laureate is female, what impact her gender has on the discipline in which she was awarded a Nobel Prize or the time period in which the prize was awarded. An analyst may also be interested in examining the impacts on discipline or gender, based on the country a laureate comes from. Therefore, it is important to consider how one may expand the MCA of this data by reflecting upon the asymmetric association structure that may exist (either by design or by the nature of the association being investigated) between categories. Doing so may offer additional insight into the awarding of the Nobel Prize.